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Foreword 



Chester E. Finn, Jr. 



The No Child Left Behind Act is now just a month old, but it’s already yowling and 
a lot of people are as nervous about it as new parents, unsure whether to feed it, hug it, 
put it to bed or spank it. 

This is an enormous piece of legislation that possibly no human being has read from 
cover to cover. It spans dozens of programs and contains thousands of specific features. It 
ranges from Indian education to impact aid, from teacher quality to bilingual education, 
and on and on. 

In the seven policy papers that follow, as in the conference that the Thomas B. 
Fordham Foundation hosted on February 13, the focus is on the implementation of the 
part of the act that got the most attention, stirred the most controversy, is perhaps the 
most different from previous versions of E.S.E.A., and is probably fraught with the 
greatest uncertainty: I refer, of course, to the new requirements concerning state academic 
standards and testing programs, the intersection of state testing and national assessment, 
the tracking of yearly progress, and the various interventions, rewards and sanctions that 
are wrapped into what’s generally called “accountability” at the state, district and school 
levels. 

No matter what one thought of the President’s initial proposal (which I happened to 
like a great deal) or of the compromises and alterations that Congress worked in it (many 
of which I didn’t like nearly so much), NCLB is now the law and I expect that everyone 
wants it to work effectively in carrying out its stated purposes, which including boosting 
student achievement, improving schools, giving people better information and closing 
some long-lasting and troubling gaps, so that, indeed, no child will be left behind. 

The standards, testing and accountability provisions are at the core of this hope and 
this promise. But they turn out to be complicated. And somewhat mysterious. We 
actually don’t know quite what is going to happen in the implementation of this law. 
That’s partly because Congress left many important decisions to the Education 
Department and to the states and we don’t know how they’re going to handle these; 
partly because we’re worried by the cautionary tales of weak and uneven implementation 
of past rounds of E.S.E.A.; partly because this is a country in which people hold different 
ideas of what constitutes good education and what’s reasonable to expect from schools; 
and partly because NCLB embodies an idiosyncratic set of compromises between what 
the fifty states have discretion to do differently and what must be done uniformly by all 
of them. To recall just the most obvious of many examples: under NCLB, states are free 
to set their academic proficiency bars wherever they like but, whether they set them high 
or low, and no matter where their students are today in relation to those bars, they all 
have the same twelve years to get all their children over those bars. 



These papers begin to explore such mysteries in the upcoming implementation of 
NCLB. We at the Fordham Foundation began the project with the premise that everyone 
wants it to work but that there’s no unanimity on how that can or should happen and 
plenty of reason to worry about things that could go wrong, come unstuck, not be done at 
all, be done badly, not be foreseen, etc. 

So we asked seven smart people (two of whom have co-authors, making for eleven 
smart authors) to examine some of these issues. We asked that their papers be written fast 
and kept short and accessible to ordinary readers. Most of the authors did most of those 
things. In fact, on the whole, they did a pretty terrific job. 

Which doesn’t mean they necessarily agree. There are interesting differences of 
view just among these seven papers. For example, Lisa Keegan and her colleagues are 
more bullish about what can be done with norm-referenced tests than Matt Gandal is. 

We don’t necessarily agree with them, either. I would come down differently on 
some issues. And some, indeed, are so intricate that another smart author, looking at the 
same issue, might have a different view of what the law provides and what the available 
data show. 

Some of these differences came out in the lively discussion at the February 13 
conference where, along with the authors, we were joined by five very able commenters, 
by the equally able Undersecretary of Education, Gene Hickock, and by 140 savvy and 
engaged education policy watchers, participants, analysts and journalists. It was lively, 
probing and sometimes a bit contentious. We wish more people could have been there. 

The seven papers themselves provide much of the grist for that discussion, 
however, and they are worth the time and attention of anyone interested in the 
implementation of NCLB. It’s a moving target, of course, The Education Department is 
already gearing up for “negotiated rulemaking.” Much is in flux. That’s why we 
concluded that getting these papers — some of them still working drafts — into cyberspace 
as quickly as possible would be more helpful than slowly trundling forth with a fully 
edited report of the traditional sort. 

Reader comments and feedback are cordially invited. Let me emphasize that we’re 
putting forth no “position” with these papers and have no political agenda. This is part of 
an earnest effort to begin reading the entrails of the No Child Left Behind Act in the hope 
that, if we understand them better, and are smart about what can and should and shouldn’t 
happen, maybe we can boost the odds that this will indeed work well for American 
children, especially the neediest among them. 

Chester E. Finn, Jr., President 
Thomas B. Fordham Foundation 
Washington, DC 
February 2002 



Multiple Choices: How Will States Fill 
in the Blanks in their Testing Systems? 

Matthew Gandal 



If someone had told me a couple of years ago that, over the next few years, every state 
was going to institute a grade-by-grade testing system, I would have laughed and thought 
that person was out of touch with reality and, frankly, politically naive. Most states 
hadn’t even established academic standards in each grade, let alone tests, and some were 
experiencing significant resistance from educators in the few grades where they were 
already testing. In a good number of states, moreover, policymakers did not believe 
grade-by-grade testing was necessary or desirable. Why would they all move to an 
annual testing system and how in the world would they pull it off? 

What I hadn’t considered was the confluence of events that would lead to the 
reauthorization of the Elementary and Secondary Education Act: a Republican president 
who believes in testing and accountability from a state that has shown that grade-by- 
grade testing can help raise achievement; his ability to get key members of his own party 
in Congress to stop viewing state standards and tests as an intrusion in local control of 
schools but rather a lever to improve them; and the leadership of key Congressional 
Democrats, who have come to see the power of standards and tests as a tool for achieving 
greater equity in American education and improving the life chances of the poorest 
children. 

Now that the legislation has passed and the bill has been signed by the president, the 
question remains: how are states going to pull it off? The new ESEA amendments 
require states to begin administering annual tests in grades 3 through 8 in reading and 
math by the 2005-2006 school year. The previous law required states to test all students 
in those subjects but only twice within that 6-grade span. Only 16 states currently have 
grade-by-grade tests in reading and math, and only 9 of those states have tests aligned 
with their standards (a requirement of the law). The rest will have to fill in the blanks 
with new tests. Achieve estimates that well over 200 new state-level tests will have to be 
created over the next several years to meet the new federal requirements. 

3 Big Questions 

States have made great progress over the last ten years in setting academic standards for 
students and communicating those expectations to schools and parents. Most states have 
also tried to align their assessment systems with their standards so that what they are 
testing becomes more transparent for educators and parents and so that whatever “stakes” 
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are attached to the test results are matched by reasonable opportunities for children 
actually to learn that which they’re being held responsible for knowing. There is still 
considerable room for improvement, to be sure. But the groundwork is in place in nearly 
every state. As states move forward to fill in the gaps in their annual testing system, it is 
critical that the quality of the new tests and their alignment with state standards not get 
sacrificed. 

Are States Ready? 

Are states ready to respond to this challenge? It’s too soon to be sure. Some states 
already have tests in all but one or two grade levels, so they only have to create a few 
new tests. But most states will have to more than double the number of tests they are 
now giving, and in doing so they will face both educational and political challenges (and 
incur financial costs as well). The educational challenges have to do with the quality of 
the tests and their usefulness in improving teaching and learning. This is something that 
states are already struggling with. The political challenges involve state and local control 
tensions and sustaining support from educators, parents, and business and community 
leaders. 

Optimally, states will view the federal legislation as an opportunity to take a fresh look at 
their standards, assessments, and accountability systems and do what it takes to 
strengthen them. The goal should not simply be to fill in the blank years with tests so 
that every student is being tested in every grade. Rather, the goal should be to 
intelligently craft an assessment system that provides teachers, schools, and parents with 
the data they need to focus attention and resources and achieve better results. 

Is the Market Ready? 

Directly related to the question of state capacity is the capacity of the testing industry. 
One of education’s dirty little secrets (made less secret last spring by a series of 
investigative reports by The New York Times) is that four major publishing companies 
have a virtual monopoly on the state testing market. While a few smaller firms have 
made some inroads over the last several years, the “big four” dominate this $700 million 
a year industry, creating and administering the tests in most states. 

This raises some urgent questions: do these few companies have the capacity to develop 
over 200 new tests in a very short period of time? The normal cycle for creating a new 
assessment in just one state is 2-3 years. This now needs to happen in two subject areas 
in multiple grade levels in at least 34 states ! In order to meet this demand, will the 
companies be forced to sacrifice their own (variable) standards of quality? Will they end 
up recycling old test questions and putting together hasty processes for creating new 
questions, thereby lowering the quality and sophistication of the assessments? 

Is the Public Ready? 

No matter how states approach the development of their new assessments, their greatest 
challenge by far will be sustaining the support of educators, parents, and the broader 
public as the new tests and accountability measures get rolled out. In poll after poll, 
parents, voters, taxpayers and opinion leaders have said they support testing, even high- 
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stakes testing, because it provides them with some assurance that schools are effectively 
teaching and students are successfully learning. Educators have been less staunch in their 
support. They generally agree with raising academic standards, and acknowledge that 
tests are needed to measure achievement, but their support has begun to waver as real 
accountability measures have been put in place. 

State and local policymakers will need to be mindful of this as they contemplate how to 
fill in the gaps in their testing programs. Few educators relish the idea of adding more 
tests on top of those they already have. States will need to be strategic: as new state tests 
get added, duplicative local tests should be taken away. And educators are sure to pay 
attention to what the new tests are measuring. The narrower and less sophisticated the 
questions, the more we will hear complaints from teachers that they are being forced to 
water down — or narrow — their teaching and focus on a test-prep curriculum. 

The Challenge Ahead 

At its core, the new law challenges states to measure student achievement more often in 
order to ensure that students are progressing on a path to proficiency. The idea is not to 
wait several years before taking the students’ academic temperature, but rather to do it in 
every grade. More frequent testing leads to more frequent feedback to teachers, students 
and parents. And that feedback should allow schools to focus instruction where it is most 
needed and address achievement gaps for the benefit of all students. It is also intended to 
enable policy makers to intervene in situations where the testing reveals inadequate 
progress being made. 

There are, however, a number of challenges to making this work as conceived, and 
although the law lists some important criteria state assessments will need to meet, 
Congress has left many of the toughest decisions to the U.S Department of Education and 
to the states themselves. 

As states fill in the gaps in their testing systems, here are some of the things to watch out 
for: Will the new tests be adequately aligned to state standards? How challenging are 
those standards— are they worth aligning to? Will the new tests be aligned with existing 
tests, such that they measure a logical progression of skills from 3 rd to 4 th grade, from 4 th 
to 5 th and so on through 8 th grade? Will the tests be sufficiently challenging? Will they 
measure advanced concepts as well as basic skills? Will the results be comparable across 
school districts within each state? How rigorous an approach will each state take to 
defining what it means to be “proficient”? How quickly and effectively will states report 
scores back to schools and households? Will states be mindful of the testing burden and 
work with districts to ensure that, as new tests get created, old ones head for retirement? 

The governors, business, and education leaders who attended the 2001 National 
Education Summit last fall anticipated many of these issues and committed themselves to 
a set of principles that, if followed, will lead to stronger assessment and accountability 
systems. States that successfully address these challenges will end up taking maximum 
advantage of the opportunities the new law affords. Those that do not may very well end 
up taking a step backward in their reforms. 
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Testing Principles adopted at 2001 Summit: 



Quality - State tests should be designed to measure student progress 
against clear and rigorous standards. Reports sent to schools and parents 
should indicate how students perform against the standards — not just 
how they compare with other students. Tests developed for other purposes 
cannot meet this need. The tests should measure the full range of 
knowledge and skills called for by the standards, from basic to most 
advanced. 

Transparency - In a standards-based system there should be no mystery 
about what is on the test. Students, parents, and teachers should know 
what is being tested. They should be confident that if students are taught a 
curriculum that is aligned with state standards, they will do well on state 
tests. The best way for states to ensure transparency is to publicly release 
questions from previous years’ tests, along with sample student answers at 
each performance level. 

Utility - Ultimately, it is the clarity of the results and the manner in which 
they are used that will make a difference in schools. Test results should be 
returned to schools and parents as quickly as possible without compromising 
the quality of the test instrument. Score reports should be clear, jargon-free, 
and designed to guide action. 

Comparability - The goal of state assessment programs is to create 
measurement systems that can accurately track and compare student and 
school progress from year to year. To accomplish this, the tests from one 
grade level to another must be aligned with state standards, and the results 
must be comparable from grade to grade so that student progress can be 
tracked from year to year. 

Coherence - State tests are only one piece of a comprehensive data 
system. Local and teacher-developed assessments are important too. 

States must work with districts to ensure that all tests serve a distinct 
purpose, redundant tests are dropped, and the combined burden of state 
and local tests remains reasonable. 

Strategic Use of Data - Closing the achievement gap can only occur if 
student achievement data is disaggregated by race and income, and if 
schools are required to show that all groups of students have made 
reasonable progress. By regularly reporting how every school is 
performing against state standards, states can focus attention on the 
problem, on the progress that some communities and schools are making 
in response, and on areas where additional work is needed. 
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How Will States Respond? Four Scenarios 

While ESEA lays down some clear markers on issues of academic standards, testing, and 
accountability, states have numerous options in determining how to fulfill the 
requirements. The Department of Education will either need to get much more concrete 
about what is expected or the states will end up determining the answers to these 
questions themselves. It is worth playing out several plausible scenarios to highlight the 
costs and benefits of the different approaches states might take. 

Scenario # 1 — Cheap and Easy 

It is more costly and time consuming to create new tests aligned with state standards than 
to take existing tests off a publisher’s shelf and assert that they are aligned. The fastest, 
cheapest way for states to fill in the gaps in their testing programs is to purchase ready- 
made tests such as the Stanford 9, Iowa Test of Basic Skills, and Terra Nova. These are in 
widespread use in schools today, but they are not designed to measure student attainment 
of any particular state’s standards. Rather, their main purpose is to compare one student’s 
achievement against that of other students in a national sample, in essence comparing that 
child against an average. 

Comparing pupil performance to an average or “norm” is very different than measuring 
whether or not that child has met a specific set of academic targets. The targets, or 
standards, provide something for students and teachers to aim for, and those standards do 
not fluctuate based on how other children are doing. 

Although it is not impossible for commercial tests to be well aligned with states’ 
standards, it is highly unlikely. In studies that Achieve has conducted for states, we have 
found that commercial tests typically touch on some standards but miss the mark on 
others. The pattern is that commercial tests tend to focus on what is easiest to assess, and 
it is often the most rigorous knowledge and skills that are not adequately measured. The 
result is a testing system that is out of sync with what states profess they want students to 
leam. 

If, therefore, states opt to use “off-the-shelf’ tests to fill in the grades where they do not 
currently have tests, they will likely sacrifice the measurement of their standards in those 
grades. A combination of customized tests in some grades and off-the-shelf tests in 
others may also end up sending mixed signals to schools and parents about what students 
are expected to leam. If, for example, a state uses customized tests in 4 th and 8 th grades 
and off-the-shelf tests in the other grades, the 4 th and 8 th grade teachers may end up 
paying attention to the state standards because that is what is being tested, but the 
teachers in the other grades may pay less attention to the standards and more attention to 
what’s on the commercial tests. Imagine a school trying to organize its curriculum in 
such an environment; imagine teachers trying to collaborate across the grades; imagine 
parents trying to make sense of their children’s test scores from grade to grade. 
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There is a twist on this strategy that a few states have pursued. In order to get a testing 
system in place quickly, California began in 1998 by adopting a series of off-the-shelf 
tests for grades 2-11 (the Stanford 9) and then worked with the testing company 
(Harcourt Educational Measurement) to adapt or “augment” those tests over time to align 
better with the state’s own standards. Starting in 1999, California children began taking 
the augmented version of the tests, called “STAR” exams (Standardized Testing and 
Reporting System). These exams consist of a combination of questions from the Stanford 
9 and new test questions that were added to reflect the California standards. According to 
state officials, as many as 75% of the test questions in math had to be created from 
scratch to align with the standards; a smaller number of new questions were needed in 
English. 

Although education officials in California readily admit that their unorthodox approach 
caused confusion and even skepticism in schools across the state, they seem optimistic 
that their transitional strategy will result in tests aligned with their standards. Before 
other states consider trying this approach, though, it is worth a more careful look: Just 
how different are the “augmented” tests from the original ones? How well do they in fact 
align with the state standards (which, by the way, are among the most rigorous in the 
nation)? If they do, in fact, align well, how much of that has to do with the fact that 
California’s size and market share allowed it to push the testing company harder than a 
typical state could? Most states find that they have little leverage over these companies, 
but big states have greater influence due to the size of their student populations and the 
huge markets that get opened up for textbooks and other products. 

The truth is, alignment of tests with standards is difficult to achieve. Even states that 
have created their own tests from scratch have had a hard time measuring their standards 
well. But getting it right will be essential if the new assessments that states create are to 
add value to the existing ones, and become tools that teachers, parents, and policymakers 
can rely on to raise student achievement. Doing that well is not apt to be cheap. 

Scenario #2 — Leave it to Districts 

As state leaders have pondered how they’re going to fill in the grades where they 
currently do not have tests, some have said that they would rather let districts use their 
own local tests in the years when the state does not test. This is clearly the most 
politically convenient solution, as it sidesteps the state/local tensions and allows districts 
that already test students in grades 3-8 to leave those tests in place. It does, however, 
raise serious questions about the comparability of data across those districts. 

Formal studies by the National Research Council and informal studies by Achieve have 
concluded that it is nearly impossible to compare results of different tests in any 
meaningful way. This is because different tests measure different concepts and skills, so 
proficiency on one test rarely translates to proficiency on another. If states were to pursue 
this path of least resistance, therefore, they will likely sacrifice the ability to compare 
achievement results across districts in the grade levels where the state itself does not test. 
How important is this to states? Will the lack of a common test in each grade skew the 
accountability system? Which tests will be factored into the adequate yearly progress 
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formula: the state tests, the local tests or both? How can one provide cumulative results 
for the state as a whole if the tests differ from place to place within it? Wouldn’t that lead 
to data that are very difficult to disaggregate? Will multiple tests send conflicting signals 
to schools as to where they should focus their curriculum and instruction? 

Scenario #3 — New Customized Tests 

In order to stay true to the principles of alignment, coherence, and comparability, the 
most desirable strategy for building an annual testing system is for states to develop new 
tests for the grades where they don’t have them. Those tests would be both aligned to the 
their academic standards and aligned with the tests that they already have. 

There are several different ways states might approach this. Some may choose to match 
the length and sophistication of their existing tests. Other states may decide to alter the 
format and length of their new tests. They may do this to reduce costs, to reduce the 
amount of time needed for students to take the tests, or to make the tests more diagnostic 
and useful to local educators. This is where a creative approach to the task could have 
the greatest educational payoff. 

Imagine a state that currently has reading and math tests in 3 rd , 5 th , and 8 th grades, and 
each of those tests is 90 minutes long and consists of a combination of multiple-choice 
and extended response questions (i.e., questions requiring written answers, such as 
essays). Confident in the data those existing tests provide and wary of the costs of 
producing identical tests in new grades, state officials might decide to create a shorter 
version for grades four and seven designed to provide a brief snapshot in between the 
other tests. The new tests might have fewer questions or rely more heavily on multiple- 
choice questions, and might only require 45 minutes of test-taking time. This approach 
would allow states with sophisticated assessments to maintain them at some grades while 
using more economical versions at other grades. 

Another approach might be to make the new tests as sophisticated as the existing tests, 
but to get creative in how they are scored. Indiana is one state considering this. The idea 
officials are exploring is to have classroom teachers scoring certain portions of their 
students’ tests and to make the results immediately accessible to schools and parents. 
There would clearly be quality control and consistency issues that the state would need to 
work out, but in addition to saving money on centralized scoring, one of the benefits of 
this approach is that teachers would be much more invested in the assessment process 
and, therefore, may end up using the results in their classrooms. In fact, done right, 
grading state assessments could be a very effective form of professional development. 
Indiana is also exploring the development of formative assessments that teachers can 
voluntarily use at any point during the school year to determine how their students are 
advancing toward the state standards. 

However states approach the task of creating new tests, it is critical that they remain 
vigilant about test quality. Achieve’s work has revealed that even states that have created 
their own assessments for the purpose of measuring their own standards have had a 
difficult time getting it right. 
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Scenario #4 — State Collaboration 

When it comes to creating high quality tests worth teaching to and basing serious 
accountability systems on, the deck is clearly stacked against most states. High quality 
tests cost more to create and there is a limited pool of talent available to help them 
accomplish this. Given these tensions and the real pressure that states are under to get so 
many new tests in place relatively quickly, it is legitimate to ask why states need to go it 
alone. 

The most logical strategy for responding to the ESEA testing requirements is for states to 
pool resources and develop common assessments that they can share. This would allow 
states that do not have the market power of California, New York, and Texas to work 
together to leverage better quality tests. They are all relying on the same few companies 
to create these tests. Why not step back, form strategic partnerships, and leverage the 
situation? 

There are three reasons that states should consider doing this. The upsides are better 
quality tests, lower costs, and more comparable data across states since they will be using 
the same tests. The cost savings could be significant at a time when state budgets are 
tight and it’s not clear whether Washington is earmarking enough money to offset state 
testing costs. The comparability advantage also deserves more attention than it typically 
gets: one reason the legislation requires all states to give NAEP reading and math 
assessments every two years is that policymakers want better ways to compare results 
across states against a common standard. Why not build that comparability into states’ 
own assessment systems while they have the chance? This happens to be the reason 
some state policymakers and parents like the idea of using norm-referenced tests — it 
gives them some ability to compare results beyond their state. 

The new law specifically allows states to form consortia and pool resources to create and 
use common tests. The main thing standing in the way at this point seems to be habit. 
States are used to working individually with test publishers to create their own tests. 

They are not used to a collaborative approach. This may change as states look ahead at 
the need to build over 200 new tests. 

There is at least one consortium already in place that could be very helpful to states as 
they develop their ESEA strategies. At the request of governors and education 
commissioners in a number of states, Achieve launched an initiative in 1999 known as 
the Mathematics Achievement Partnership to help states work together to raise 
mathematics standards and achievement. Fourteen states are currently involved in the 
partnership, which will provide them with an internationally benchmarked 8 th grade math 
assessment, tools for improving the middle school math curriculum, and strategies for 
improving the professional development of middle school math teachers. We are 
exploring how states can tap into the consortium to develop tests in the grades where they 
currently do not have them. 
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Getting It Right 

The task ahead for states in building an annual testing system reminds me of what must 
be a fairly typical challenge facing city planners when they address changes in traffic 
patterns. Oftentimes, heavier usage on some roads necessitates adding stop lights at more 
intersections to control traffic and ensure safety. When confronted with the challenge of 
adding traffic lights at more intersections along a busy street, what would a thoughtful 
city planner do? Would he purchase the least expensive product even if the signals it sent 
were different than those of the existing traffic lights? Would he ask the residents on 
each block to build or buy their own traffic light? How would traffic be affected if the 
new signals were not timed with the existing ones? Would it help control the flow of 
vehicles or simply confuse and frustrate drivers and pedestrians? 

The thoughtful city planner keeps the endgame in mind as he devises his plan. The goals 
are safety and the smooth flow of traffic, not placing a traffic light at each intersection. 
That’s simply a means to the end. If poor decisions are made, it is quite possible that the 
addition of lights at each comer could make the streets more congested and less safe. 

It is the same with building an annual testing system. Approached intelligently, grade- 
by-grade testing can be a real improvement over what many states currently have in 
place. But not all strategies for creating annual tests will result in a coherent assessment 
system. States must take care to get it right. 

The President and Congress did make an effort to address some of the issues discussed in 
this paper. There are a series of criteria laid out in the law that state assessment systems 
will need to meet. These include: alignment with state standards; reporting scores for 
each individual student; disaggregating the data by race, ethnicity, and socio-economic 
status; providing itemized analyses pointing to students’ strengths and weaknesses in 
each particular skill area; returning the results before the beginning for the next school 
year; and assessing “higher order thinking skills and understanding.” 

At this stage, the question on most people’s minds is how rigorous federal officials will 
be in their interpretation of these criteria and, more importantly, how serious they will be 
about enforcing them. Federal officials can and should play an important role in 
clarifying criteria and reviewing state plans, and if they take a hard line on some of these 
important issues, states could be left with a smaller but smarter set of options. 

If past experience is our guide, however, we should not expect the federal government to 
fully solve complex issues such as the quality, alignment, comparability, coherence and 
utility of state standards and assessment systems. The federal government can lay down 
clearer markers and use the bully pulpit, but in the end, these are issues that state leaders 
must address for themselves. 
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Using NAEP to Confirm State Test 
Results: An Analysis of Issues 

Mark D. Reckase 



The new Elementary and Secondary Education Act (ESEA) amendments, “No 
Child Left Behind,” require that the National Assessment of Educational Progress 
(NAEP) reading and mathematics tests be administered every other year in grades 4 and 
8. Further, states must participate in the component of NAEP that is used to obtain 
estimates of students' academic performance at the state level. This part of the NAEP 
program is called State-NAEP. Participation in State-NAEP has been voluntary in the 
past, but the ESEA amendments make participation a condition of accepting Federal 
funds related to the legislation. While the legislation does not indicate what is to be done 
with the results of NAEP testing, it does imply that NAEP will be used as a check on the 
reading and mathematics assessment results reported by each state. Further, states will be 
required to administer their own reading and mathematics assessments to their students 
every year in grades 3 through 8. The purpose of this policy memo is to summarize the 
issues related to the use of NAEP to confirm the assessment results reported by states. 

Testing Programs in the ESEA Legislation 
A Brief Description of NAEP 

NAEP is an extensive program of data collection that includes achievement tests in 
a number of subjects, including, but not limited to mathematics and reading. NAEP also 
collects information about characteristics of the student population and features of the 
educational system. NAEP results, and the many interpretive reports produced from 
those results, provide an ongoing description of the functioning of the educational 
systems in the United States. 1 

NAEP tests are uniquely different from state assessments in a number of ways. 
First, the tests attempt to measure student capabilities (what students know and can do) 
on a domain of process and content knowledge that is common to the state educational 
systems across the United States. The creators of the document describing what is 
included in that domain also attempt to include content and processes recommended in 
future-oriented standards documents (e.g., those promulgated by the National Council of 
Teachers of Mathematics) so that the domain definition will be applicable for a number 
of years into the future. Allowing the national standards documents to influence the 
domain definitions implies that states are expected to move their curriculum in the 
direction of those standards. 



1 Details of features of NAEP are presented in a number of documents including Braswell, Lutkus, Grigg, 
Santapau, Tay-Lim and Johnson (2001). 
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The domain of coverage for a NAEP subject matter area is described in a 
document called a “framework” (e.g., Reading Framework for the 1992 National 
Assessment of Educational Progress (NAGB (1992)). A consequence of the need for 
NAEP to be appropriate for assessing student performance in all states is that it can not 
focus too closely on the educational goals from any one state. NAEP assesses the 
common core of all state programs, but it does not assess the instructional goals that are 
unique to individual states. 

A second way that NAEP is unique is that no student takes the entire test. Because 
NAEP endeavors to assess what students know and can do in a very broad domain, the 
full NAEP tests contains a large number of questions —145 to 160 questions for NAEP 
Mathematics, for instance. This number of questions is too large for any student to 
attempt in a reasonable period of time. To keep thorough domain coverage, but also keep 
the testing time to a reasonable amount, each student takes only 36 to 45 mathematics 
questions. Test booklets contain overlapping sets of questions so that the results from all 
of the examinees can be combined to determine the expected distribution of performance 
on the full set of questions for the full sample of students. However, it is not possible to 
obtain a good estimate of performance on the full domain of knowledge and skills for any 
individual student because the student has responded to only a small part of the entire 
test. 



A third unique feature of NAEP is a direct result of the item and student sampling 
approach that it uses to keep testing demands within reasonable bounds. Because 
students take only part of the test, no student scores are reported. Also, tests are only 
administered to a random sample of students from the nation and from within 
participating states. A consequence of the sampling approach is that only estimated score 
distributions for state and national groups can be reported. NAEP summarizes the 
information from these distributions using percentages above achievement levels set by 
the National Assessment Governing Board (NAGB) and descriptive statistics (means and 
standard deviations). It is not possible to track individual student’s performance on 
NAEP over years or directly compare student performance on NAEP with that student’s 
performance on a state test. Nor is it possible to report NAEP results at the school 
building level because only a small number of students from any school take the test, and 
those students take only part of the full set of test questions. 

The unique features of NAEP have not interfered with its use as a general indicator 
of the quality of education in the United States. However, they will need to be taken into 
account when NAEP results are compared to state results. 

State Assessments 



State assessment procedures are notable for the diversity of approaches that they 
take. Some states purchase existing tests from commercial test publishers as all or part of 
the state assessment program. This approach would seem to indicate that these state 
education officials believe that the commercial tests are sufficiently aligned with the 
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curriculum and instruction goals for the state. Other states hire test development 
contractors to custom develop elaborate assessment programs according to state 
developed test specifications. The test specifications for these programs vary greatly. 
Some include performance assessment tasks that are scored by commercial companies, 
others are multiple-choice only, and some use computerized testing procedures as part of 
the assessment program. One state (Iowa) does not have a state assessment program, 
though most students in the state take the Iowa Tests of Basic Skills and Iowa Tests of 
Educational Development at some point in their schooling . 2 

The diversity of state assessment programs provides a challenge for the use of 
NAEP to confirm the results of those assessments. The state assessment programs have 
different content, schedules for administration, purposes, stakes, and technical 
characteristics. Further, many of these features will likely change in response to the 
ESEA legislation. At the very least, many states will have to increase the frequency of 
testing in grades three through eight in reading and mathematics. The next section of this 
memo highlights a number of the more important issues related to the use of NAEP for 
confirmation purposes. The following sections discuss the effects of differences in state 
testing programs on the interpretation of NAEP/state assessment comparisons. 

The Relationship between NAEP and a State Assessment 
Domain Overlap 

The starting point in the design of an achievement test is the specification of the 
domain of content and skills to be covered by the test. In theory, there should be a 
description of the domain at a level of detail that will allow an interested party to 
determine whether a specific test task should be on the test because it measures part of 
the domain, or whether it should be excluded because it does not. The NAEP framework 
documents are good examples of domain specifications. Unfortunately, like everything 
else with state assessment programs, descriptions of domains vary substantially across 
states. Some give very general statements of academic goals; others provide detailed 
descriptions of desired academic content and skills. 

A key to determining the comparability of NAEP and state assessment results is an 
evaluation of the commonality of the target domains. The following diagram gives a 
simplified representation of the overlap in those domains. The content domain for a state 
assessment program is represented by one circle and the domain for NAEP is represented 
by another circle. Within a circle is the content and skills to be measured by an 
assessment program. Outside the circles are the content and skills that are not included in 
the domains for either of the two assessment programs. For each assessment, there is part 
of the domain that is in common with the other assessment and part that is not. 




2 A brief summary of state assessment programs is given in Meyer, Orlofsky, Skinner and Spicer (2002). 
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States vary in the amount that their assessment domains overlap with NAEP. For 
some, there is almost complete overlap. For others, the overlap is modest. 
Unfortunately, there do not seem to be any formal studies of the amount of overlap 
between domains for NAEP and state assessments. Such studies would be major 
undertakings that would require in-depth analysis of every state testing program. There 
would be a further complication that state assessment programs are not static - they 
change frequently, sometimes because of changes to the curriculum, but also because of 
other factors such as the need to reduce costs, or because of changes in educational policy 
within the state. 

Assuming that the amount and composition of domain overlap can be determined, a 
critical issue when comparing NAEP and state assessment results is the part of the 
domain that is emphasized by instruction within the state. Although a state may define a 
large content and skill domain as the focus of instruction, not all parts of that domain will 
be treated with the same emphasis in every classroom. If the focus of classroom 
instruction is on parts of the state's domain that do not overlap with the NAEP domain, 
then student performance may improve and be documented on the state assessment while 
that improvement is not shown on NAEP. NAEP might even show a decline if the part of 
the domain that is common to the assessment programs and the part that is unique to 
NAEP are given little instructional emphasis. 

To the extent that NAEP has captured the important outcomes of the nation’s 
educational systems, the cases of low domain overlap and of instructional focus on things 
not covered by NAEP should be rare. But it is possible that a state could show 
improvement, NAEP could show decline, and they could both be correct because 
instruction is focusing on different parts of the combined domain for the two tests. 
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Performance Standards 



NAEP reports results in two ways. The first is estimated test score distributions on 
the NAEP standard score scale. This type of reporting includes mean scores for 
demographic groups and state samples. The second way that NAEP results are reported 
is percentages above achievement levels set by NAEP's governing body, the National 
Assessment Governing Board (NAGB). NAGB has set three such levels labeled basic, 
proficient, and advanced. The achievement levels are ranges between cut scores on the 
NAEP score scale. NAGB considers these cut scores as definitions of performance goals 
for what students should know and be able to do at grades 4, 8, and 12. 3 The NAGB 
achievement levels take on special meaning in the ESEA legislation because the 
legislation specifies that states must define their own “proficient” and “advanced” levels, 
as well as a “basic” level. The language of the legislation uses the same labels already 
used for the NAGB-developed achievement levels on NAEP. 

States also set cut scores on their assessments, but even when they use the same 
labels as the NAGB achievement levels the meanings of the state standards might be 
quite different. For example, a state may use the term “proficient,” but in terms of the 
number of students who attain that level or higher, the state’s proficient level may be 
similar to the NAGB “basic” level. Such differences in meaning of state and NAGB 
standards are not likely a sign of duplicity. The research on standard setting shows that 
different standard setting methods, different statements of policy, and standard setting 
panels with different characteristics are likely to produce different standards. 4 

The location of cut scores on a score scale is important because the location 
indicates where the reporting system will be sensitive to changes in student performance. 
Consider the following thought experiment. Suppose that a standard is set on a 
mathematics test by placing a cut score for reporting at roughly the level of difficulty of 
simple addition problems. Also suppose that at grade 4 in one school, the students are 
not yet doing well on addition, while at another school most of the students have 
mastered addition. In the first school, if instruction focuses on simple addition, many 
students will move from below the standard to above the standard. It is likely that the 
percent above the standard will improve quite dramatically. In the second school, 
however, because the students already know the material and because instruction is 
focused on other, probably higher level skills and knowledge (e.g., fractions), the 
increase in percent of students attaining that state's standard in that school will be small. 
The opposite effect can occur if the cut score is set at a level that is consistent with the 
difficulty of the fraction problems. In that case, the second school would show a lot of 
improvement and the first school would show very little. 

The NAGB “proficient” level is a fairly high standard. Changes in the percent 
above that standard will likely reflect achievement gains for students whose instruction 
focuses on the more difficult NAEP content. Changes in the proportion above “basic” 
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3 For a discussion of the issues related to the standards set by NAGB, see Bourque and Byrd (2000). 

4 See Cizek (2001) for recent information on standard setting. 

15 



- KJ 



19 



will likely show improvements for students whose instruction focuses on relatively easy 
NAEP content. 

Context of the Assessment 



Not only do NAEP and state assessments differ on domain coverage and the 
placement of performance standards, they also differ in the context for the assessment; 
that is, the way that the assessment is perceived by the students and the local school 
district staff. For example, some states use their assessments to determine whether 
students will be promoted to the next grade or whether school staff will receive monetary 
awards for helping students reach instructional goals. These assessment programs are 
called “high stakes” because there is a direct and important consequence to the students 
and school staff. In such cases, it is likely that students will be motivated to do well and 
the school staff will do what they can to help the students perform at their best. 

The amount of “stakes” for state assessments varies quite dramatically. Some states 
use the assessment results only for general school accountability purposes with no direct 
consequences for students. Some states test a sampling of students rather than every 
student. Other states make the assessments a very important part of the state instructional 
system. Teacher salaries may depend on the assessment results and students may receive 
direct rewards or punishments. The high level of variability across states with regard to 
“stakes” adds to the complexity of comparing state results with NAEP results. 

NAEP has no direct consequences for students or school staff because NAEP 
results are not reported at the school or student level. Students do not receive scores and 
schools do not receive summaries of student performance. These features of NAEP make 
it a “low stakes” assessment at the school and student level. The differences between 
contexts for state assessments and NAEP need to be taken into account when interpreting 
comparative results. 

Analysis 

When comparing state assessment results with NAEP results for a single curriculum 
area, there are nine possible results as depicted in the cells in the following table. NAEP 
confirming state results would seem to require that both testing programs have results in 
the cells with the Xs. The question of concern here is "How likely is it that NAEP and 
state assessments will give results in these cells?" To answer this question, all of the 
issues that have been summarized need to be considered. 



Decline 



State Assessment 



Increase 



X 








X 








X 



NAEP 



Decline 
Stay Level 
Increase 



First, the issue of domain overlap needs to be considered. For most states, the 
domain overlap between NAEP and the state assessment will be at least moderate. 
NAEP was designed to measure the common content of the instructional systems of all of 
the states. Unless a state has instructional goals that are notably different than those of 
other states, there should be some commonality between domains of coverage for NAEP 
and a state assessment. However, it is not likely that the overlap will be total for any 
state. It is possible that there may be important parts of a state domain that are unique to 
the state and not included in the content of NAEP. If the state focuses instruction and 
assessment on the unique features to the exclusion of the common components, it is 
possible for the state assessment to show gains when NAEP does not. It is also possible 
for NAEP to show gains when a state assessment does not if instruction focuses on the 
unique features of NAEP (e.g., instruction may be focused on national curriculum 
standards) rather than the unique features of the state assessment. This seems less likely, 
but possible. The existence of these possibilities suggests that part of the interpretation of 
NAEP results for confirming state results will need to be a judgment of the overlap 
between the assessment domains. Substantial overlap makes NAEP a stronger tool for 
confirmation. Low overlap indicates that NAEP can not provide solid evidence for 
confirmation or disconfirmation. 

Second, the context of the state assessment will also likely affect the usefulness of 
NAEP as a source of evidence for confirmation. If the state assessment is high stakes and 
NAEP is low stakes, students may try very hard on the state assessment and not very hard 
on the NAEP. Real situations may be more complicated. There are more possibilities 
than motivated and not motivated. Students vary in level of motivation and the level of 
student motivation may interact with the level of difficulty of items. Students may give a 
reasonable level of effort to easy items even when the test does not count for them, but 
they may give up on hard items when the test does not have direct consequences. The 
result of differences in stakes may be that students show improvement on the state 
assessment if it is high stakes and no improvement or a decline on NAEP. 

The context of state assessments and NAEP may differ in other ways that may 
affect the comparison of results. The assessment programs may be administered at 
different times of the year. If the state assessment is administered in the fall, and NAEP 
is administered in the spring, the amount of exposure to the curriculum will differ. The 
differences in instructional time will influence the amount that students have learned by 
the time the test is administered and the amount of gain that can be detected. The quality 
of the assessments may also differ, affecting the confidence that can be placed in the 
reported results. 

The location of standards on the assessment can result in similar differences in 
results. Students at all points in a distribution of performance will not likely improve by 
equal amounts. If a school focuses on the improvement of basic skills, performance 
standards set at a relatively low level will show the greatest change in the percent 
attaining those standards. The NAGB “proficient” level is a high standard so it may not 
be sensitive to changes in basic skills. A basic skills oriented state standard might show 
improvement while the percent above NAGB “proficient” does not. The opposite may 
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occur for schools focusing instruction at a higher level - NAEP may show changes when 
the state assessment does not. 

A solution to this problem is to look at changes at all levels of student achievement 
rather than at single cut scores. NAGB is currently investigating reporting procedures for 
NAEP that can show changes along the entire NAEP score scale. These same procedures 
could be used by states as well. 

The description of state and NAEP assessment programs given here is based on the 
current characteristics of those programs. However, the legislation will likely result in 
significant changes to both NAEP and state assessments. A recent review of state testing 
programs in Education Week indicates that only eight states currently meet the 
requirements set out in the legislation. Many states will have to expand their reading and 
mathematics assessments to meet the requirement of testing every year from grade 3 to 
grade 8. NAEP will also have to change its testing schedule to provide results every 
other year in mathematics and reading. While it is likely that significant changes in these 
assessment programs will occur, the full impact of the changes will not likely be 
understood for several years. 



Conclusions 

Jointly interpreting state assessment and NAEP results in a coherent way will not 
be a simple task. Many factors need to be taken into account when making such 
interpretations including the amount of content overlap, the location of cut scores on the 
score scales, and the context for the assessments. This is not to suggest that the joint 
interpretation of the test data is impossible or unwise. Experience from analysis of ACT 
and SAT college admissions tests and other testing programs indicates that tests 
constructed from different test specifications can yield highly correlated results. It is 
likely that NAEP results and state assessment results will be related as well. With careful 
consideration of threats to accurate interpretations and realistic judgments about the 
amount of effort that will be required to make accurate interpretations, joint use of NAEP 
and state assessment results should lead to better understandings of the functioning of the 
educational systems in the United States. 
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Adequate Yearly Progress: 

Results, not Process 

Lisa Graham Keegan, Billie J. Orr & Brian J. Jones 

When President Bush signed the No Child Left Behind Act of 2001 (NCLB) into law on 
January 8, 2002, he brought to the public school system a new demand. All students — 
regardless of race or socioeconomic status — must be held to the same academic 
expectations, and all students — regardless of race or socioeconomic status — must have 
their academic progress measured using a newly-refined concept of adequate yearly 
progress (AYP). 1 

The term AYP should be nothing new to educators. Title I of the previous version of the 
Elementary and Secondary Education Act, the Improving America’s Schools Act (IASA) 
of 1994, introduced the concept of adequate progress in its requirements that all states 
establish academic content standards, develop tests to assess student progress in those 
standards, and create performance standards for those tests. But the focus of the 1994 
law centered much more on the process of building the AYP mechanism that would be 
used to measure achievement in Title I schools and for Title I students than it did on 
ensuring actual academic progress for all students. Consequently, most states have dual 
accountability systems in place — one for Title I schools and another for all public 
schools. In 2000, only 22 states had a single, unified system to judge the performance of 
all public schools. 2 

With NCLB, all this changed. The play is no longer the thing; success in complying with 
the law will no longer be based upon whether a state has created academic standards and 
testing, but rather on how well all of its students are doing in making real progress toward 
meeting those standards. That means testing all students, and it means using the same 
system for all students; thus NCLB requires states to use a single accountability system 
for all public elementary and secondary schools to determine whether all students are 
making progress toward meeting state academic content standards. 

This expectation defined by NCLB — that all children will make continuous progress 
toward proficiency on state standards — is the underlying motive behind the new AYP. 

The goal is to ensure that all students, regardless of what they look like or how much 
money their parents earn, make adequate yearly progress, period. “All students can 
learn” is no longer just a mantra, it’s a goal that will be measured every year. 

The AYP process sounds relatively straightforward: States set the bar for what is deemed 
“proficient” in relation to their academic standards. They must then define what level of 



1 No Child Left Behind Act , P.L. 107-1 10, 107 th Congress, 1 st Session, 2001 . 

2 Margaret E. Goertz and others, “ Assessment and Accountability Systems in the 50 States: 1999-2000” 
(University of Pennsylvania: Consortium for Policy Research in Education, 2001), 30. 
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improvement will be sufficient each year to determine not only whether districts and 
schools have made “adequate yearly progress” toward meeting the standard of 
proficiency, but also the rate at which they will get all students to proficiency in twelve 
years. Finally, after testing students each year, states will disaggregate the testing results 
to determine how specific populations of students are achieving at the state, district, and 
school levels, and make those results available to the public. This is simple in 
description, but complicated in execution — and, ultimately, central to the law. AYP is 
used throughout NCLB to determine compliance, rewards, and sanctions. Process is not 
enough; it’s results that count. 

Precisely how we define results — even when it comes to such seemingly simple tasks as 
defining terms like proficient or adequate — will be decided in collaboration with the U.S. 
Department of Education and the states. While this law gives strong guidance, we would 
all do well to approach this collaborative process with humility. State accountability 
systems that seek to ensure the academic success of all students are still relatively new 
and unstudied phenomena. Our experience to date has given us much confidence that the 
broad infrastructure of NCLB is sound, but there is still much to learn and many ways to 
approach the requirements of this new law. 

Defining a System: “Specific Ambiguity” 

Under NCLB, Congress provided the states with significant flexibility in developing 
state accountability systems, and with greater flexibility in general program 
administration than has previously been permitted in federal education law. For example, 
State and local education agencies will be allowed for the first time to shift up to 50 
percent of their non-Title I administrative funds between programs, or they may even 
shift these funds into Title I itself (though they cannot move funds out of Title I to other 
accounts). States can also apply to receive “flexibility authority,” which will be awarded 
to seven states on a competitive basis to demonstrate even greater gains with greater 
freedom. 

Consistent with this new flexibility, while the objectives of the AYP requirements in 
NCLB are obvious as general guidance, they leave a great deal of room for interpretation 
in their specific implementation. For this reason, the U.S. Department of Education will 
be issuing further instruction on many of the details of the law. We would advise those 
involved in the rulemaking and guidance process to proceed cautiously, for the very 
vagueness of the law — this “specific ambiguity” — is actually an asset, as it leaves each 
state room to experiment within its own strengths and limitations. Rulemakers should not 
eliminate the desired and intentional ambiguity of the law; rather, they should jointly be 
seeking ways to learn from it. As Thomas J. Kane noted in an analysis of the House and 
Senate AYP proposals, 

...states are currently experimenting with a wide range of different 
types of accountability systems. They should be allowed to 
continue experimenting, until the Nation reaches a consensus 
regarding the ideal way to determine which schools are making 
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adequate yearly progress and which are not.... [I]mpatience is an 
insufficient excuse for bad education policy. 3 

While NCLB defers in certain respects to state policies and practices, it does lay down 
some non-negotiable directives that states must adhere to in their efforts to develop an 
AYP process. One might compare this to a road map on which main thoroughfares and 
destination are clearly marked, but unmarked side streets and alleys are also open to 
travel along the way. 

Under the law, each state is required to work with its teachers, parents, principals and 
local educational agencies to create a state plan that incorporates challenging academic 
content standards and student achievement standards that apply to all children within the 
state. The academic achievement standards (formerly called performance standards) 
must describe basic, proficient and advanced levels of achievement. As stated 
previously, this is crucial to understanding the concept of AYP, because the goal is for all 
children to reach the proficient level (or beyond). The state must also implement a single 
accountability system that ensures that its schools, districts and the state as a whole make 
adequate yearly progress. 

Further, while each state is responsible for the specifics in defining how it will determine 
“progress,” the federal law is clear that the state’s definitions of AYP must have the 
same high standards of achievement for all public schools in the state, and they must 
follow a 12-year timeline for getting all students to proficiency. The state’s criteria must 
be statistically valid and reliable, require continuous and substantial improvement for all 
students, and measure progress based on state reading and mathematics tests. Secondary 
schools must include graduation rates as a factor in determining progress, and elementary 
schools must use one additional indicator such as attendance, promotion rates or 
increases in participation in advanced classes. 

Data from the 2001-2002 school year will establish the starting point for measuring the 
percentage of students meeting or exceeding the state’s level of proficiency. States must 
set the initial bar at a level based on either its lowest achieving demographic group, or the 
scores of its lowest achieving schools, whichever is higher. However, regardless of 
where the initial bar is placed, states must define AYP so that all students in all groups 
are expected to improve and achieve the proficiency level in 12 years. 4 The law is 
specific in this goal, but ambiguous in the starting point, deferring to the states for the 
criteria they will use for the initial placement of the bar. 

Once the starting level has been determined, states must then begin raising the bar over 
time, increasing the number of students meeting or exceeding the state’s level of 
proficiency over time, with the goal being 100% of students at proficiency in 12 years. 
The statute requires that the bar be raised in equal increments over time, and must be 
raised for the first time not later that two years into the process, and then again at least 




3 Thomas J. Kane and others, “Assessing the Definition of ‘Adequate Yearly Progress' in the House and 
Senate Education Bills. ” (Los Angeles: School of Public Policy and Social Research, UCLA, 2001), 12. 

4 No Child Left Behind Act, P.L. 107-1 10, Section 1111 (b)(2), 107 th Congress, 1 st Session, 2001. 

23 



26 



once every three years. Where states have leeway is in determining the initial “height” of 
the bar, and the rate at which it will be raised over time until 100% of students reach 
proficiency. 

Finally, to ensure that the most disadvantaged students do not get left behind in this 
process — so that states and schools don’t get the more affluent children to proficiency 
first, then go back and start working on at-risk children in the waning years of the 12 year 
deadline — states must include separate measurable objectives for “continuous and 
substantial improvement” in both reading and math for students who are minorities, poor, 
disabled, or of limited-English proficiency (LEP). This is how states can monitor how 
well they are doing in closing the achievement gap. 

The bottom line is that, in order to demonstrate adequate yearly progress, the state and its 
districts must show that schools are meeting or exceeding the state annual measurable 
objectives for all students and for students within each subgroup. 

It is important to note that there is also a “safe -harbor” provision found within NCLB, 
created to address the concern that too many schools would be identified as failing simply 
because one subgroup — for example, LEP students — failed to meet the state AYP goals. 
This provision allows schools to avoid being considered as failing so long as (in this 
particular example) the number of LEP students who are below proficiency decreases by 
10 percent when compared with the proceeding year, and if LEP students also made 
progress on one or more of the additional academic indicators listed above. The law also 
requires at least 95% of students enrolled in the school and in each subgroup take the 
state tests in order to meet the standards of AYP. 5 

As an external audit for states to gauge the quality of their own standards — to give them 
some idea of how high their bar for proficiency is set and how well they have defined 
progress toward that bar — states will be required every other year to administer the 
National Assessment of Educational Progress (NAEP) tests in reading and math. This is 
not only a significant change from prior law (where NAEP was optional and administered 
only once every four years) but a critical one. NAEP results will act as both light and 
leverage for states serious about taking a closer look at their standards and making any 
necessary modifications to ensure that they remain rigorous. 

What will an ideal system look like? Frankly, we’re not sure yet. Clearly, states will 
develop a single accountability system for all students, create definitions of progress that 
fall within federal parameters, and lay out a timeline for getting all students to 
proficiency in 12 years — and there end the details. Through NCLB, the federal 
government has said, “Here are the guidelines, the flexibility, the resources, and the 
expectations. We’ll meet you back here in 12 years, and we’ll provide you with an 
external audit through NAEP every other year, but we want 100% of your students at 
proficiency or higher.” In the meantime, states should take advantage of the specific 
ambiguity in the law and build the system that works best for them. 



5 No Child Left Behind Act, P.L. 107-110, Section 1111 (I), 107 th Congress, 1 st Session, 2001 . 



Building a System: Norm- vs. Criterion-Referencing 



It is likely that the goals of AYP will be realized in ways that have not been pursued on a 
national basis, but which will be diligently pursued in individual states. Therefore, we 
would advise caution when overseeing developing systems, and not hasten to declare 
them insufficient in process so long as the outcome data they seek and produce match the 
goals and objectives of the law. Remember, this is about results, not process. 

Accountability systems are still a new science. Few have been well researched. Many 
exist on paper, though few have been employed over any significant period of time. For 
this reason, educators, testing directors, and federal officials engaged in “approving” a 
given approach would be well advised to gather all of the pertinent data currently 
available. We may be in for a few surprises. 

As an example, we hear a compelling and well-reasoned argument that the best method 
for testing students is to use a criterion-referenced test that has been tailor-made to 
directly correlate to a state’s specific standards. If that argument is universalized as a 
compliance requirement of NCLB, every state that has not yet done so must commission 
the development of a specialized criterion-referenced test for use every year, rather than 
use any number of pre-existing commercial tests. 

The argument for this approach says that only tests designed specifically around a state’s 
standards can adequately reflect student progress toward those standards. Or so current 
accountability theory seems to suggest. 

Theory is one thing, but we may miss potentially powerful state approaches if this theory 
dictates all future practice. In fact, requiring each state to develop an annual criterion- 
referenced test will immediately undermine extensive efforts already underway in states 
such as California, Arizona, and Tennessee, among others. These states currently use 
norm-referenced tests or test items to gauge academic progress down to the level of an 
individual student, and what they have found bears further study. 

Some of their preliminary data suggest that this method of analyzing student achievement 
results in data comparable in quality and result to that derived from analysis of criterion- 
referenced tests. Until there is sufficient research in this area by those who know testing 
systems best, we should avoid dismissing the use of norm-referenced tests at the outset of 
this endeavor. 

A quick look at Arizona’s testing data should show why. Arizona administers both a 
criterion -referenced test (the AIMS test, shown in the left column on the next page) and a 
norm-referenced test (SAT-9, in the right column). If we lay the results of these two tests 
next to each other — understanding that there are technical differences in the 
administration of the tests that make a perfect correlation impossible — : the results are still 
remarkably similar. 6 



6 In this particular case, percentile scores have been converted to normal curve equivalents for a more valid 
comparison of criterion- and norm-referenced test scores. (See above explanation in text.) 
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Figure 1. Results from Arizona’s criterion-referenced test (on the left) and norm- 
referenced test (on the right) are remarkably similar. 
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It can, of course, be argued that a criterion-referenced test is more precisely matched to 
the state’s specific standards. We don’t disagree. Yet, norm-referenced tests are also 
based on a publicized set of standards, and these are generally consistent with those used 
for criterion-referenced tests. Bear in mind the goal of showing progress — a gain in 
knowledge of material deemed most essential for student success. Both a criterion- 
referenced and a norm-referenced test are made up of questions designed to make an 
effective judgment of student knowledge and skills in defined areas. Where they differ 
most significantly is presumably in their range of difficulty. 

While a norm-referenced test seeks questions chosen to elicit a bell-shaped performance 
curve, the criterion-referenced test is made up of questions meant to match the standard. 
For norm-referenced tests, results are displayed primarily in a percentile ranking scale for 
comparison to other students, based on a nationwide “norming” population. However, 
most national norm-referenced tests also offer conversion of their percentile scores into a 
curve representing points given for every correct answer. As the Arizona data show, 
curves and performance levels for the converted norm-referenced tests nearly mirror 
criterion-referenced test results. 

An additional point bears mentioning. Based on his work in Tennessee over the past 15 
years, Dr. William Sanders offers the opinion that we do not need to have an 
excruciatingly tight match of state standards to specific test items. In fact, he places far 
more importance on “freshening” a test annually with new items than he does on specific 
linking to a particular standard. 7 It could well be that we have placed too much emphasis 
on states writing their own unique tests. This is yet another assertion that deserves 
additional study. 

We are not arguing that criterion-referenced tests and norm-referenced tests are 
interchangeable. They are designed for different purposes and with distinct strengths and 
weaknesses, but the assumption that a state-developed criterion-referenced test better 
identifies student growth than a norm-referenced “test off the shelf’ may not withstand 
in-depth analysis. The data produced by both norm- and criterion-referenced tests are so 
strikingly similar that an automatic preference for use of a criterion-referenced test to 
gauge student progress as part of NCLB seems unwarranted for the moment. 

A final word in this regard: Those of us who support NCLB clearly believe that the core 
set of knowledge we seek for our students is sufficiently similar as to be assessable with a 
more generalized examination — otherwise, why the prominent role of the National 
Assessment of Educational Progress (NAEP) as an external audit for states in the new 
law? One cannot argue that gain can only be viewed within the confines of unique state 
assessments while simultaneously extolling the ability of NAEP to judge achievement 
across the board. 



7 Education Commission of the States. A Closer Look: State Policy Trends in Three Key Areas of the Bush 
Education Plan — Testing, Accountability and School Choice. (Denver: Education Commission of the 
States,- 2001), 8. 
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The conclusion? We need more comparison and research regarding what these tests tell 
us. There are presently a number of states that not only use both norm- and criterion- 
referenced tests, but they also use them in different subjects, different grades, and, in 
some cases, in different locations around their state. Equating the results of this blend of 
norm- and criterion-referenced testing may be valid — and then again it may not. Until 
we have more data from the administration of these tests, and the opportunity to look at 
this data in a meaningful way, we ought not be in a hurry to junk the use of norm- 
referenced tests. Educators should currently worry less about whether a test is norm- or 
criterion-referenced, and concentrate instead on its relationship to state goals, and to 
collecting and analyzing the results of those tests in meaningful ways. We’re looking at 
progress, not process. 



High Stakes and Consequences 

AYP requires states to disaggregate test results not only by communities and schools but 
also by specific sub-groups of students. Such disaggregation gives educators and parents 
a truer idea of what is really going on in their school — after all, a school that appears to 
be making progress when one looks at its average score may also show, upon closer 
examination, that certain groups of students have made little or no gains. Disaggregation 
of results is a necessary tool of accountability to ensure that schools do not hide failing 
groups of students behind the law of averages. 

So, what happens if students in a school or in a particular subgroup do not meet or exceed 
the state’s defined standard for AYP? The answer is simple: that school would not make 
adequate yearly progress. The NCLB is very clear about the consequences that such 
schools will face, and the stakes are high. 

If schools and districts do not show gain over a defined period of time, action will be 
taken on behalf of the students in those schools, including mandatory public school 
choice and the provision of individual supplemental services purchased with Title I 
funds. In addition, chronically failing schools face the very real possibility of having 
their schools completely restructured, while states that fail to meet their obligations under 
their state plan risk the loss of federal administrative dollars. 

These potential penalties resonate loudly with schools, districts and states, and they send 
a clear message to parents that the law is serious about providing them opportunities to 
remove their children from consistently-failing schools. In a welcome break with past 
policy, school failure will result in meaningful consequences, and will empower parents 
to immediately remove their children from failing schools, instead of consigning them to 
continued failure. Further, in a contrast to the overall mood of NCLB, the timelines and 
sanctions imposed for school failure are specific and non-negotiable, as they should be. 
There is simply no more room for flexibility when it comes to consequences for failing 
schools. 

If a school fails to make adequate yearly progress for two consecutive years, it will be 
identified by the district and state as needing improvement. This identification will mean 
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that federal funds will be available to states and districts to provide schools with technical 
assistance to improve academic achievement — but financial assistance alone is no longer 
seen as a sufficient tonic for the ailment. The school is also subject to stricter and more 
rigorous sanctions to ensure that change occurs as quickly as possible. After two years of 
failure, the district is required to create a plan to turn the school around and to offer 
public school choice to all students in the failing school by the beginning of the next 
school year. Further, the district must pay the costs of transporting any students who opt 
to attend a different public school, including public charter schools. 

If a school fails to make adequate yearly progress for three consecutive years, it must not 
only continue to offer public school choice for all students, but must also allow 
disadvantaged students in the failing school to use Title I funds to pay for supplemental 
services from a provider of choice. Schools will be required to set aside 20 percent of 
their total Title I allocation to pay for both the supplemental services and transportation to 
these services. Not less than 5 percent must be used for each. 

After four years of failure to make adequate yearly progress, districts are required by law 
to implement corrective action in their school. This means that, in addition to continuing 
the provision of public school choice and supplemental services, districts must intervene 
more forcefully. This could mean removing school staff, changing school leadership, or 
altering curriculum and programs. Finally, to stem the tide of continuous failure, any 
schools that fail to make adequate progress for five consecutive years would be 
completely restructured. This might mean a state takeover, alternative governance, 
private management, new staff, or becoming a charter school. In essence, they will begin 
anew. 

Schools will be released from the “corrective action” category only after making 
adequate yearly progress for two consecutive years. 

With the enactment of NCLB, these consequences go into immediate effect for schools 
that have already been identified as in need of improvement under the IASA. These 
schools — some 6,700 of them 8 — are considered to be in their first year of school 
improvement (in 2001-2002) and must offer public school choice in the coming school 
year (2002-2003). Likewise, the 3,000 schools that are already in their second year of 
school improvement under the previous law must provide individual student services to 
supplement the regular school day in addition to public school choice for all low-income 
students in the coming year. This means students who have been in schools identified as 
failing for two or three years will receive immediate help through NCLB. The clock does 
not start over for these students, and failing schools do not receive an amnesty period 
simply because the law changed. 

Just as schools are held to showing results under the AYP process, so too are school 
districts and, ultimately, the state. The state, usually through its state department of 




8 House Committee on Education and the Workforce, Press Release: H.R. 1 Education Reforms Would 
Mean Immediate New Options for Students In Thousands of Failing Schools — Beginning in 2002, 
December 13, 2001. 
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